This is an archive for R-sig-geo, not a forum. It is not possible to post through Nabble - you may not start a new thread nor follow up an existing thread. If you wish to post, but are not subscribed to the list through the list homepage, subscribe first (through the list homepage, not via Nabble) and post from your subscribed email address. Until 2015-06-20, subscribers could post through Nabble, but policy has been changed as too many non-subscribers misunderstood the interface.
Updated: 43 min 5 sec ago

### problems in anisotropy detection

Sat, 07/13/2019 - 00:32
Dear friends,

I have a dataset of hydraulic heads, mydata = <X, Y, HH, Z> and i would
like to check for anisotropy.

In R I found 3 functions to carry out such task:

1. likfit

x_geodata <- as.geodata(mydata, coord.cols = 1:2, data.col = 3,
covar.col = 4)
fit_mle  <- likfit(x_geodata,
fix.nugget = FALSE,
cov.model = "exponential", psiA = 0, psiR = 1,
ini = c(var(Y), 1), fix.psiA = FALSE, fix.psiR =
FALSE,
hessian = TRUE)

that detects no anisotropy.

2. estimateAnisotropy

mydata.sp <- mydata
coordinates(mydata.sp) = ~ X + Y
estimateAnisotropy(mydata.sp, depVar = "LivStat", "LivStat ~ Z")

that returns the following

[generalized least squares trend estimation]
$ratio [1] 1.340775$direction
[1] -35.29765

$Q Q11 Q22 Q12 [1,] 1.926136e-05 2.329241e-05 5.721893e-06$doRotation
[1] TRUE

finally,

3. estiStAni

vmod1 <- fit.variogram(vgm1, vgm(18, "Ste", 1300, 0.78, kappa = 1.7))
estiStAni(vgm1, c(10, 150), "vgm", vmod1)

that returns an error:

Error in $<-.data.frame(*tmp*, "dir.hor", value = 0) : replacement has 1 row, data has 0. I am very puzzled, can anyone help me understanding if there is anisotropy in my dataset? thanks emanuele _______________________________________________ R-sig-Geo mailing list [hidden email] https://stat.ethz.ch/mailman/listinfo/r-sig-geo ### Re: LR1.sarlm specifications Thu, 07/11/2019 - 09:33 Dear Prof. Roger Bivand, Thank you very much for clarifying my doubt. This helped me a great deal! On Thu, Jul 11, 2019 at 12:58 PM Roger Bivand <[hidden email]> wrote: > On Thu, 11 Jul 2019, Amitha Puranik wrote: > > > Hi Prof. Roger, > > > > I am using the approach proposed by Prof Paul Erhorst for choosing a > > spatial model in his paper '*Applied spatial econometrics: raising the > > bar*' . As per the strategy, one has to check the likelihood ratio test > > for theta (spatial autocorrelation in exogenous (independent) variables) > > and also in theta+rho*beta (spatial autocorrelation in residuals). > > Suppose I fit a spatial durbin model and use the code LR1.sarlm(sp.dm), > > how would I know whether the likelihood ratio test checks for > > autocorrelation in dependent variable or autocorrelation in the > > independent variable? > > The spatialreg::LR1.sarlm() test simply between the fitted model and the > same model assuming the spatial coefficients are zero, so it only tests > the possible benefit of including (a) spatial process(es). > spatialreg::LR.sarlm() lets you test between nested models, and works like > lmtest::lrtest(). The models need to be nested, so you can test SDM/SLM, > SDM/SEM (equivalent to a Common Factor test), and so on, but only if the > models nest (not SEM/SLM, because they do not nest). > > Hope this helps, > > Roger > > > > > Thanks in advance. > > Amitha Puranik. > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > R-sig-Geo mailing list > > [hidden email] > > https://stat.ethz.ch/mailman/listinfo/r-sig-geo > > > > -- > Roger Bivand > Department of Economics, Norwegian School of Economics, > Helleveien 30, N-5045 Bergen, Norway. > voice: +47 55 95 93 55; e-mail: [hidden email] > https://orcid.org/0000-0003-2392-6140 > https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en > [[alternative HTML version deleted]] _______________________________________________ R-sig-Geo mailing list [hidden email] https://stat.ethz.ch/mailman/listinfo/r-sig-geo ### PhD course ECS530, Bergen, 2-7 December Thu, 07/11/2019 - 02:32 A PhD-level course in spatial data analysis (ECS530, Auditorium C) will be held 2-7 December 2019 in Bergen: https://www.nhh.no/en/courses/analysing-spatial-data/ External participants should apply using this form: https://www.nhh.no/en/study-programmes/phd-programme-at-nhh/phd-courses/application-to-follow-phd-courses-at-nhh/ For further details see the course page. Participants must cover their own travel and living costs; no support is offered, I'm afraid, apart from free tuition. Roger -- Roger Bivand Department of Economics, Norwegian School of Economics, Helleveien 30, N-5045 Bergen, Norway. voice: +47 55 95 93 55; e-mail: [hidden email] https://orcid.org/0000-0003-2392-6140 https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en _______________________________________________ R-sig-Geo mailing list [hidden email] https://stat.ethz.ch/mailman/listinfo/r-sig-geo Roger Bivand Department of Economics Norwegian School of Economics Helleveien 30 N-5045 Bergen, Norway ### Re: LR1.sarlm specifications Thu, 07/11/2019 - 02:27 On Thu, 11 Jul 2019, Amitha Puranik wrote: > Hi Prof. Roger, > > I am using the approach proposed by Prof Paul Erhorst for choosing a > spatial model in his paper '*Applied spatial econometrics: raising the > bar*' . As per the strategy, one has to check the likelihood ratio test > for theta (spatial autocorrelation in exogenous (independent) variables) > and also in theta+rho*beta (spatial autocorrelation in residuals). > Suppose I fit a spatial durbin model and use the code LR1.sarlm(sp.dm), > how would I know whether the likelihood ratio test checks for > autocorrelation in dependent variable or autocorrelation in the > independent variable? The spatialreg::LR1.sarlm() test simply between the fitted model and the same model assuming the spatial coefficients are zero, so it only tests the possible benefit of including (a) spatial process(es). spatialreg::LR.sarlm() lets you test between nested models, and works like lmtest::lrtest(). The models need to be nested, so you can test SDM/SLM, SDM/SEM (equivalent to a Common Factor test), and so on, but only if the models nest (not SEM/SLM, because they do not nest). Hope this helps, Roger > > Thanks in advance. > Amitha Puranik. > > [[alternative HTML version deleted]] > > _______________________________________________ > R-sig-Geo mailing list > [hidden email] > https://stat.ethz.ch/mailman/listinfo/r-sig-geo > -- Roger Bivand Department of Economics, Norwegian School of Economics, Helleveien 30, N-5045 Bergen, Norway. voice: +47 55 95 93 55; e-mail: [hidden email] https://orcid.org/0000-0003-2392-6140 https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en _______________________________________________ R-sig-Geo mailing list [hidden email] https://stat.ethz.ch/mailman/listinfo/r-sig-geo Roger Bivand Department of Economics Norwegian School of Economics Helleveien 30 N-5045 Bergen, Norway ### LR1.sarlm specifications Wed, 07/10/2019 - 22:15 Hi Prof. Roger, I am using the approach proposed by Prof Paul Erhorst for choosing a spatial model in his paper '*Applied spatial econometrics: raising the bar*' . As per the strategy, one has to check the likelihood ratio test for theta (spatial autocorrelation in exogenous (independent) variables) and also in theta+rho*beta (spatial autocorrelation in residuals). Suppose I fit a spatial durbin model and use the code LR1.sarlm(sp.dm), how would I know whether the likelihood ratio test checks for autocorrelation in dependent variable or autocorrelation in the independent variable? Thanks in advance. Amitha Puranik. [[alternative HTML version deleted]] _______________________________________________ R-sig-Geo mailing list [hidden email] https://stat.ethz.ch/mailman/listinfo/r-sig-geo ### Re: geoJSON and leaflet Tue, 07/09/2019 - 09:47 v\:* {behavior:url(#default#VML);} o\:* {behavior:url(#default#VML);} w\:* {behavior:url(#default#VML);} .shape {behavior:url(#default#VML);} Hi Tim, Thanks again for your response ! It fit perfectly ! Best Ben De : Tim Salabim <[hidden email]> Envoyé : mardi 9 juillet 2019 16:25 À : Benjamin Pauget <[hidden email]>; r-sig-geo <[hidden email]> Objet : Re: [R-sig-Geo] geoJSON and leaflet Hi Benjamin, please respond to the list as well. Other people may have similar issues and this way we can find the solutions online. If you want to extract the data along with the geometries, here's how you could do it: library(leaflet) library(jsonlite) library(sf) url <- "http://www.georisques.gouv.fr/api/v1/sis?rayon=1000&latlon=4.854899%2C%2045.763079&page=1&page_size=10" geojson <- jsonlite::fromJSON(url) # convert coordinate arrays to matrices geom = lapply(geojson$data$geom$coordinates, matrix, ncol = 2)
# create multipolygon from coordinate matrices
geom = lapply(geom, function(i) st_polygon(list(i)))
# overwrite the geom column of data in geojson
geojson$data$geom = st_sfc(geom, crs = 4326)
# exract data from geojson and turn into an sf object
dat = st_as_sf(geojson$data) leaflet() %>% addTiles()%>% setView(lng = 4.854899, lat = 45.763079, zoom = 14) %>% addMarkers(lng = 4.872536, lat = 45.758321)%>% addPolygons(data = dat, popup = ~nom) This way, as you can see, you can use the tilde (~) notation to include popups referring to a column of the data. I don't think that what you get using RSONIO is any closer to a valid geojson. Best Tim On Tue, Jul 9, 2019 at 3:28 PM Benjamin Pauget <[hidden email]> wrote: Dear Tim, Tanks you soooo much for your response ! I am a beginner and working with geospatial is still unclear. I have tried to open my json file with “from_JSON” and I have another format of my data : geojson4 <- RJSONIO::fromJSON(url) Do you think this format is more relevant and will allow to display the pop’up info ? Thank you again for your code (I just pass the two last day on trying to display the polygons) ! Best regards, Ben De : Tim Salabim <[hidden email]> Envoyé : mardi 9 juillet 2019 15:14 À : Benjamin Pauget <[hidden email]> Cc : [hidden email] Objet : Re: [R-sig-Geo] geoJSON and leaflet Hi Benjamin, What you get back from the jsonlite::fromJSON() call is a simplified list with the data (including the geometry information) as a data frame in one of the list slots. GeoJson is usually a character string. Therefore, if you open your map in the browser and open the console (Ctrl + i) you will see the error message: "Invalid GeoJson object". I am no expert on geojson structure, but it seems that the data that you request is not in standard format. I tried a few different ways of parsing the data to a valid GeoJson string but did not have success. Maybe someone else with more insight has some helpful ideas how to achieve this. However, I found a workaround to get the polygons shown on the map using library(sf) - see code below. Note, this only visualises the geometry information only, so direct popup queries via "~" are not possible. library(leaflet) library(jsonlite) library(sf) url <- "http://www.georisques.gouv.fr/api/v1/sis?rayon=1000&latlon=4.854899%2C%2045.763079&page=1&page_size=10" geojson <- jsonlite::fromJSON(url) # convert coordinate arrays to matrices geom = lapply(geojson$data$geom$coordinates, matrix, ncol = 2)
# create multipolygon from coordinate matrices
geom = st_cast(st_polygon(geom), "MULTIPOLYGON")

leaflet() %>%
setView(lng = 4.854899, lat = 45.763079, zoom = 14) %>%
addMarkers(lng = 4.872536, lat = 45.758321)%>%

HTH,

Tim

On Tue, Jul 9, 2019 at 1:49 PM Benjamin Pauget <[hidden email]> wrote:

Hi,

I’m writting because I have some trouble with a geoJSON file and a leaflet.

I’m trying to display polygon on a leaflet map, but nothing append ☹

I have no error message.

Best regards

Here is my code :

library(leaflet)

library(jsonlite)

geojson <- jsonlite::fromJSON(url)

leaflet() %>%

setView(lng = 4.854899, lat = 45.763079, zoom = 14) %>%

addMarkers(lng = 4.872536, lat = 45.758321)%>%

addGeoJSON(geojson) # doesn’t work by using geojson$data or geojson$data$geom Benjamin PAUGET Responsable R&D +33 (0)1 81 94 13 70 +33 (0)6 47 01 85 92 Le Visium 22 Av Aristide Briand 94110 ARCUEIL https://tesora.fr/ Linkedin _______________________________________________ R-sig-Geo mailing list [hidden email] https://stat.ethz.ch/mailman/listinfo/r-sig-geo _______________________________________________ R-sig-Geo mailing list [hidden email] https://stat.ethz.ch/mailman/listinfo/r-sig-geo ### Re: geoJSON and leaflet Tue, 07/09/2019 - 09:24 Hi Benjamin,please respond to the list as well. Other people may have similar issues and this way we can find the solutions online. If you want to extract the data along with the geometries, here's how you could do it: library(leaflet) library(jsonlite) library(sf) url <- "http://www.georisques.gouv.fr/api/v1/sis?rayon=1000&latlon=4.854899%2C%2045.763079&page=1&page_size=10" geojson <- jsonlite::fromJSON(url) # convert coordinate arrays to matrices geom = lapply(geojson$data$geom$coordinates, matrix, ncol = 2)
# create multipolygon from coordinate matrices
geom = lapply(geom, function(i) st_polygon(list(i)))
# overwrite the geom column of data in geojson
geojson$data$geom = st_sfc(geom, crs = 4326)
# exract data from geojson and turn into an sf object
dat = st_as_sf(geojson$data) leaflet() %>% addTiles()%>% setView(lng = 4.854899, lat = 45.763079, zoom = 14) %>% addMarkers(lng = 4.872536, lat = 45.758321)%>% addPolygons(data = dat, popup = ~nom) This way, as you can see, you can use the tilde (~) notation to include popups referring to a column of the data. I don't think that what you get using RSONIO is any closer to a valid geojson. BestTim On Tue, Jul 9, 2019 at 3:28 PM Benjamin Pauget <[hidden email]> wrote: Dear Tim, Tanks you soooo much for your response ! I am a beginner and working with geospatial is still unclear. I have tried to open my json file with “from_JSON” and I have another format of my data : geojson4 <- RJSONIO::fromJSON(url) Do you think this format is more relevant and will allow to display the pop’up info ? Thank you again for your code (I just pass the two last day on trying to display the polygons) ! Best regards, Benjamin PAUGET Responsable R&D +33 (0)1 81 94 13 70 +33 (0)6 47 01 85 92 Le Visium 22 Av Aristide Briand 94110 ARCUEIL De : Tim Salabim <[hidden email]> Envoyé : mardi 9 juillet 2019 15:14 À : Benjamin Pauget <[hidden email]> Cc : [hidden email] Objet : Re: [R-sig-Geo] geoJSON and leaflet Hi Benjamin, What you get back from the jsonlite::fromJSON() call is a simplified list with the data (including the geometry information) as a data frame in one of the list slots. GeoJson is usually a character string. Therefore, if you open your map in the browser and open the console (Ctrl + i) you will see the error message: "Invalid GeoJson object". I am no expert on geojson structure, but it seems that the data that you request is not in standard format. I tried a few different ways of parsing the data to a valid GeoJson string but did not have success. Maybe someone else with more insight has some helpful ideas how to achieve this. However, I found a workaround to get the polygons shown on the map using library(sf) - see code below. Note, this only visualises the geometry information only, so direct popup queries via "~" are not possible. library(leaflet) library(jsonlite) library(sf) url <- "http://www.georisques.gouv.fr/api/v1/sis?rayon=1000&latlon=4.854899%2C%2045.763079&page=1&page_size=10" geojson <- jsonlite::fromJSON(url) # convert coordinate arrays to matrices geom = lapply(geojson$data$geom$coordinates, matrix, ncol = 2)
# create multipolygon from coordinate matrices
geom = st_cast(st_polygon(geom), "MULTIPOLYGON")

leaflet() %>%
setView(lng = 4.854899, lat = 45.763079, zoom = 14) %>%
addMarkers(lng = 4.872536, lat = 45.758321)%>%

HTH,

Tim

On Tue, Jul 9, 2019 at 1:49 PM Benjamin Pauget <[hidden email]> wrote:

Hi,

I’m writting because I have some trouble with a geoJSON file and a leaflet.

I’m trying to display polygon on a leaflet map, but nothing append ☹

I have no error message.

Best regards

Here is my code :

library(leaflet)

library(jsonlite)

geojson <- jsonlite::fromJSON(url)

leaflet() %>%

setView(lng = 4.854899, lat = 45.763079, zoom = 14) %>%

addMarkers(lng = 4.872536, lat = 45.758321)%>%

addGeoJSON(geojson) # doesn’t work by using geojson$data or geojson$data$geom Benjamin PAUGET Responsable R&D +33 (0)1 81 94 13 70 +33 (0)6 47 01 85 92 Le Visium 22 Av Aristide Briand 94110 ARCUEIL _______________________________________________ R-sig-Geo mailing list [hidden email] https://stat.ethz.ch/mailman/listinfo/r-sig-geo _______________________________________________ R-sig-Geo mailing list [hidden email] https://stat.ethz.ch/mailman/listinfo/r-sig-geo ### Re: geoJSON and leaflet Tue, 07/09/2019 - 08:13 Hi Benjamin,What you get back from the jsonlite::fromJSON() call is a simplified list with the data (including the geometry information) as a data frame in one of the list slots. GeoJson is usually a character string. Therefore, if you open your map in the browser and open the console (Ctrl + i) you will see the error message: "Invalid GeoJson object". I am no expert on geojson structure, but it seems that the data that you request is not in standard format. I tried a few different ways of parsing the data to a valid GeoJson string but did not have success. Maybe someone else with more insight has some helpful ideas how to achieve this. However, I found a workaround to get the polygons shown on the map using library(sf) - see code below. Note, this only visualises the geometry information only, so direct popup queries via "~" are not possible. library(leaflet) library(jsonlite) library(sf) url <- "http://www.georisques.gouv.fr/api/v1/sis?rayon=1000&latlon=4.854899%2C%2045.763079&page=1&page_size=10" geojson <- jsonlite::fromJSON(url) # convert coordinate arrays to matrices geom = lapply(geojson$data$geom$coordinates, matrix, ncol = 2)
# create multipolygon from coordinate matrices
geom = st_cast(st_polygon(geom), "MULTIPOLYGON")

leaflet() %>%
setView(lng = 4.854899, lat = 45.763079, zoom = 14) %>%
addMarkers(lng = 4.872536, lat = 45.758321)%>%
HTH,Tim

On Tue, Jul 9, 2019 at 1:49 PM Benjamin Pauget <[hidden email]> wrote:

Hi,

I’m writting because I have some trouble with a geoJSON file and a leaflet.

I’m trying to display polygon on a leaflet map, but nothing append ☹

I have no error message.

Best regards

Here is my code :

library(leaflet)

library(jsonlite)

geojson <- jsonlite::fromJSON(url)

leaflet() %>%

setView(lng = 4.854899, lat = 45.763079, zoom = 14) %>%

addMarkers(lng = 4.872536, lat = 45.758321)%>%

addGeoJSON(geojson) # doesn’t work by using geojson$data or geojson$data$geom Benjamin PAUGET Responsable R&D +33 (0)1 81 94 13 70 +33 (0)6 47 01 85 92 Le Visium 22 Av Aristide Briand 94110 ARCUEIL _______________________________________________ R-sig-Geo mailing list [hidden email] https://stat.ethz.ch/mailman/listinfo/r-sig-geo _______________________________________________ R-sig-Geo mailing list [hidden email] https://stat.ethz.ch/mailman/listinfo/r-sig-geo ### geoJSON and leaflet Tue, 07/09/2019 - 06:48 v\:* {behavior:url(#default#VML);} o\:* {behavior:url(#default#VML);} w\:* {behavior:url(#default#VML);} .shape {behavior:url(#default#VML);} Hi, I’m writting because I have some trouble with a geoJSON file and a leaflet. I’m trying to display polygon on a leaflet map, but nothing append ☹ I have no error message. Do you have some advice/ideas? Best regards Here is my code : library(leaflet) library(jsonlite) url <- "http://www.georisques.gouv.fr/api/v1/sis?rayon=1000&latlon=4.854899%2C%2045.763079&page=1&page_size=10" geojson <- jsonlite::fromJSON(url) leaflet() %>% addTiles()%>% setView(lng = 4.854899, lat = 45.763079, zoom = 14) %>% addMarkers(lng = 4.872536, lat = 45.758321)%>% addGeoJSON(geojson) # doesn’t work by using geojson$data or geojson$data$geom

Benjamin PAUGET

Responsable R&D

+33 (0)1 81 94 13 70

+33 (0)6 47 01 85 92

Le Visium

22 Av Aristide Briand

94110 ARCUEIL

https://tesora.fr/

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo

### Re: Error in predict.sarlm: non-unique row.names given

Sun, 07/07/2019 - 22:32
Do provide a complete reproducible example. I really appeal to all posting
questions to give potential helpers something to work on. Asking for
reproducible examples is the absolutely dominant response to postings that
lack them, if they get any response at all.

misunderstanding:

train <- col[col$EW == 1,] test <- col[col$EW == 0,]
col.nb <- spdep::poly2nb(col)
train.nb <- spdep::poly2nb(train)
test.nb <- spdep::poly2nb(test)
attr(col.nb, "region.id")
attr(train.nb, "region.id")
attr(test.nb, "region.id")
train.mod <- lagsarlm(CRIME ~ INC + HOVAL, data=train,
listw=spdep::nb2listw(train.nb))
try(preds <- predict(train.mod, newdata=test,
listw=spdep::nb2listw(test.nb)))
preds[2]
try(preds1 <- predict(train.mod, newdata=col,
listw=spdep::nb2listw(col.nb)))
# warning

preds1[4]
try(preds2 <- predict(train.mod, newdata=test,
listw=spdep::nb2listw(col.nb)))
preds2[2]

Using the complete set of weights permits the spatial process to flow
between neighbouring members of train/test sets.

Your problem is probably that your two data objects do not use row.names
as expected:

attr(test.nb, "region.id") <- as.character(1:length(test.nb))
attr(train.nb, "region.id") <- as.character(1:length(train.nb))
train.mod1 <- lagsarlm(CRIME ~ INC + HOVAL, data=train,
listw=spdep::nb2listw(train.nb))
try(preds3 <- predict(train.mod, newdata=test,
listw=spdep::nb2listw(test.nb)))
# Error in predict.sarlm(train.mod, newdata = test, listw =
# spdep::nb2listw(test.nb)) :
#   mismatch between newdata and spatial weights. newdata should have
# region.id as row.names

as is obvious. So when the predict method is trying to assign the newdata
neighbours (it needs to identify the correct rows in newdata based on the
"region.id" attribute of the provided weights), it fails as described.

Use the whole data weights when predicting for the test set newdata=, or
if the two graphs do not neighbour each other, that is train.nb is
separate from test.nb (think two islands), make sure that the region.ids
and row.names do not overlap between test and train sets.

Goulard et al. (2017), and the help page, and report back. Remember that
you can only predict for a test set of reasonable size (because as you see
from the underlying article, you probably need an inverted nxn matrix in
the spatial lag model case).

Hope this clarifies

Roger

On Mon, 8 Jul 2019, Jiawen Ng wrote:

> Another question on predict.sarlm!
>
> Here is the line of code that is producing the error:
> pred <- spatialreg::predict.sarlm(model, df, test.listw,zero.policy = T)
>
> Here is the error:
>
> Error in mat2listw(W, row.names = region.id.mixed, style = style) :
>  non-unique row.names given
> 1: In spatialreg::predict.sarlm(model, df, test.listw,  :
>  some region.id are both in data and newdata
> 2: In subset(attr(listw.mixed, "region.id"), attr(listw.mixed, "region.id")
> %in%  :
>  longer object length is not a multiple of shorter object length
>
> Any idea how I can solve the non-unique row.names error?
>
> Thank you!
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-Geo mailing list
> [hidden email]
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>
--
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; e-mail: [hidden email]
https://orcid.org/0000-0003-2392-6140

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Roger Bivand
Department of Economics
Norwegian School of Economics
Helleveien 30
N-5045 Bergen, Norway

### Error in predict.sarlm: non-unique row.names given

Sun, 07/07/2019 - 19:02
Another question on predict.sarlm!

Here is the line of code that is producing the error:
pred <- spatialreg::predict.sarlm(model, df, test.listw,zero.policy = T)

Here is the error:

Error in mat2listw(W, row.names = region.id.mixed, style = style) :
non-unique row.names given
1: In spatialreg::predict.sarlm(model, df, test.listw,  :
some region.id are both in data and newdata
2: In subset(attr(listw.mixed, "region.id"), attr(listw.mixed, "region.id")
%in%  :
longer object length is not a multiple of shorter object length

Any idea how I can solve the non-unique row.names error?

Thank you!

[[alternative HTML version deleted]]

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo

### Split RGB raster in multiples pieces and save file in GeoTiff with conditions

Sun, 07/07/2019 - 17:21
Dear R-Sig-Geo Members,

?????? I've like to split a RGB raster in multiple pieces and save each
image in GeoTiff format, but in the file name I need the count of points
(pts) inside each piece. My code runs perfectly when I used a single
raster image, but with stack or brick object doesn't work. I've like as
final output in my directory nine GeoTiff files (I divided the original
raster in 9 parts) with files names: SplitRas1points0.tif ...
SplitRas9points0.tif

?????? My code is:

#Packages
library(raster)
library(sp)
library(rgdal)

## Create a simulated RBG rasters
r <- raster(nc=30, nr=30)
r <- setValues(r, round(runif(ncell(r))* 255))
g <- raster(nc=30, nr=30)
g <- setValues(r, round(runif(ncell(r))* 255))
b <- raster(nc=30, nr=30)
b <- setValues(r, round(runif(ncell(r))* 255))
rgb<-stack(r,g,b)
plotRGB(rgb,
?????????????? r = 1, g = 2, b = 3)

##Given interesting points coordinates
xd???????? <- c(-24.99270,45.12069,99.40321,73.64419)
yd?? <- c(-45.435267,-88.369745,-7.086949,44.174530)
pts <- data.frame(xd,yd)
pts_s<- SpatialPoints(pts)
points(pts_s, col="black", pch=16)

# This function spatially aggregates the original raster
# it turns each aggregated cell into a polygon
# then the extent of each polygon is used to crop
# the original raster.
# The function returns a list with all the pieces
# it saves and plots each piece
# The arguments are:
# raster = raster to be chopped?????????????????????? (raster object)
# ppside = pieces per side???????????????????????????????? (integer)
# save???? = write raster?????????????????????????????????????? (TRUE or FALSE)
# plot???? = do you want to plot the output? (TRUE or FALSE)
SplitRas <- function(raster,ppside,save,plot){
?? h?????????????? <- ceiling(ncol(raster)/ppside)
?? v?????????????? <- ceiling(nrow(raster)/ppside)
?? agg?????????? <- aggregate(raster,fact=c(h,v))
?? agg[]?????? <- 1:ncell(agg)
?? agg_poly <- rasterToPolygons(agg)
?? names(agg_poly) <- "polis"
?? r_list <- list()
?? for(i in 1:ncell(agg)){
?????? e1?????????????????? <- extent(agg_poly[agg_poly\$polis==i,])
?????? r_list[[i]] <- crop(raster,e1)
?? }
?? if(save==T){
?????? for(i in 1:length(r_list)){
?????????? writeRaster(r_list[[i]],filename=paste("SplitRas",i,sep=""),
?????????????????????????????????? format="GTiff",datatype="FLT4S",overwrite=TRUE)
?????? }
?? }
?? if(plot==T){
?????? par(mfrow=c(ppside,ppside))
?????? for(i in 1:length(r_list)){
?????????? plot(r_list[[i]],axes=F,legend=F,bty="n",box=FALSE)
?????? }
?? }
?? return(r_list)
}

#Slip RGB raster in 3 parts
splitRBG<-SplitRas(raster=rgb,ppside=3,save=FALSE,plot=FALSE)

#Count number of points inside each piece of raster
res<-NULL
for(i in 1:3){
?? pointcount = function(r, pts){
?? # make a raster of zeroes like the input
?? r2 = r
?? r2[] = 0
?? # get the cell index for each point and make a table:
?? counts = table(cellFromXY(r,pts))
?? # fill in the raster with the counts from the cell index:
?? r2[as.numeric(names(counts))] = counts
?? return(r2)
??}
r2 = pointcount(splitRBG[i], pts_s)
res<-rbind(res,c(r2))
}
#

#Run a code on each piece with the number of points inside each piece in
the file name (res[i])
list2 <- list()
for(i in 1:3){ # 3 is the number of pieces
?? rx <- raster(paste("splitRBG",i,".tif",sep=""))
writeRaster(splitRBG,filename=paste("SplitRas",i,"points",res[i],sep=""),
?????????????????????????? format="GTiff",datatype="FLT4S",overwrite=TRUE)
}
#

Alexandre

--
======================================================================
Alexandre dos Santos
Prote????o Florestal
IFMT - Instituto Federal de Educa????o, Ci??ncia e Tecnologia de Mato Grosso
Campus C??ceres
Caixa Postal 244
Avenida dos Ramires, s/n
Bairro: Distrito Industrial
C??ceres - MT                      CEP: 78.200-000
Fone: (+55) 65 99686-6970 (VIVO) (+55) 65 3221-2674 (FIXO)

[hidden email]
Lattes: http://lattes.cnpq.br/1360403201088680
OrcID: orcid.org/0000-0001-8232-6722
Researchgate: www.researchgate.net/profile/Alexandre_Santos10
Mendeley:www.mendeley.com/profiles/alexandre-dos-santos6/

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo

### Error in spatialreg::predict.sarlm: unknown mismatch. please report this bug

Sat, 07/06/2019 - 18:52
Dear All,

I am trying to do a out-of-sample prediction using predict.sarlm but I am
getting the 'unknown mismatch' error from predict.sarlm.

I have a training and a testing spatial points dataframe (train_spdf and
test_spdf) and I know that I would require a listw object that contains the
neighbours of the training set for fitting the model and a listw object
that contains the neighbours of training + testing set for predicting.

Here is how I have created the listw object for predicting:
all_coords <- rbind(train_spdf@coords, test_spdf@coords)
col.rel.nb <- graph2nb(relativeneigh(all_coords), sym=TRUE)
test.listw <- nb2listw(col.rel.nb)

It seems like there are at least 2 ways one can create the listw object for
model fitting.

First Way:
nb <- graph2nb(relativeneigh(train_spdf@coords), sym=TRUE)
train.listw <- nb2listw(nb)

Second Way:
train_index <- rep(1:214) # the first 214 observations should be my
training set since that was how I rbind-ed it in all_coords object
train.listw <- subset(test.listw, 1:nrow(all_coords) %in% train_index)

Eventually, the train.listw and test.listw object will be used as follows:

model <- spatialreg::lagsarlm(myformula, data=train_spdf@data,
train.listw, type="mixed",zero.policy = T)

pred <- spatialreg::predict.sarlm(model, test_spdf@data,
test.listw,zero.policy = T)

When I did it the first way, I obtained an error and warning messages from
the predict.sarlm function:
Error in spatialreg::predict.sarlm(model, final_test_spdf@data, test.listw,
:
unknown mismatch. please report this bug
1: In spatialreg::predict.sarlm(model, final_test_spdf@data, test.listw,  :
some region.id are both in data and newdata
2: In colnames(Xs.not.lagged) != colnames(Xo) :
longer object length is not a multiple of shorter object length

When I did it the second way,  I obtained warning messages from the
lagsarlm function:
Warning message:
In spatialreg::lagsarlm(as.formula(rest_formula), data =
final_train_spdf@data,  :
Aliased variables found:
A lag.B lag.C lag.D lag. E lag.F lag.G lag.H lag.I lag.J lag.K lag.L lag.M
lag.N lag.O lag.P  [... truncated]

And then the same error and warning messages as the first way when using
the predict.sarlm function:
Error in spatialreg::predict.sarlm(model2, final_test_spdf@data,
test.listw,  :
unknown mismatch. please report this bug
1: In spatialreg::predict.sarlm(model2, final_test_spdf@data, test.listw,  :
some region.id are both in data and newdata
2: In colnames(Xs.not.lagged) != colnames(Xo) :
longer object length is not a multiple of shorter object length

As you can see, either method would have me run into the 'unknown mismatch'
error in predict.sarlm.

Here I have 2 questions:
1. Which way should be the right way of creating the listw object for
fitting the model?
2. How can I solve the 'unknown mismatch' error?

Would appreciate any advice! Thank you!

[[alternative HTML version deleted]]

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo

### adehabitat: Working with enfa() function using *asc files

Thu, 07/04/2019 - 19:49
Dear R-Sig-Geo Members,

?????? Many years ago I used enfa() function of adehabitat package with my
*asc files with environmental variables and my script works very well,
but now adehabitat is deprecated and there are many versions of
adehabitat. I install the adehabitatHS but for doesn't work with my *asc
files.

?????? My objective was create a data2enfa object and for this I used
manually all the functions in an old adehabitat package (kasc2df,
import.asc, as.kasc and data2enfa) but doesn't work too. In my script I
make:

#Create enfa analysis object
pinkdata <- data2enfa(as.kasc(list(tmax=import.asc("tmax.asc") ,
tmin=import.asc("tmin.asc"))), pink.sub.pnt@coords)
#

??Error in count.points(pts, kasc) :
?? w should inherit the class SpatialPixels

?????? I need to create a data2enfa object using *asc files as tab and
organism position pink.sub.pnt@coords as pr in enfa1 <-
enfa(dudi.pca(tab), pr,?? scannf = FALSE).

?????? Please, any member that work with *asc files and ENFA help me?

Alexandre

--
======================================================================
Alexandre dos Santos
Prote????o Florestal
IFMT - Instituto Federal de Educa????o, Ci??ncia e Tecnologia de Mato Grosso
Campus C??ceres
Caixa Postal 244
Avenida dos Ramires, s/n
Bairro: Distrito Industrial
C??ceres - MT                      CEP: 78.200-000
Fone: (+55) 65 99686-6970 (VIVO) (+55) 65 3221-2674 (FIXO)

[hidden email]
Lattes: http://lattes.cnpq.br/1360403201088680
OrcID: orcid.org/0000-0001-8232-6722
Researchgate: www.researchgate.net/profile/Alexandre_Santos10
Mendeley:www.mendeley.com/profiles/alexandre-dos-santos6/

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo

### Re: How to reshape a LINESTRING sfc into equal length segments

Thu, 07/04/2019 - 11:33
Thank you very much for your answer. I tried to look into st_segmentize but
I don't think I can use it in this case (or I'm not able to). This is a
little example to show what I mean:

library(sf)
my_sf <- st_sf(a = c("1", "2"), geom = st_sfc(st_linestring(rbind(c(0, 0),
c(0, 1))), st_linestring(rbind(c(0, 1), c(1, 1)))), crs = 3003)

The sf object is formed by two touching linestring geometries of length 1m.
I would like to refactor the sf object into segments whose length is
approximately equal to 0.4m (I choose this value just for this example).
Since the two segments are touching each other, I'd like that the result
will be a new sf object with 5 geometries: 1) a linestring from (0, 0) to
(0, 0.4); 2) a linestring from (0, 0.4) to (0, 0.8); 3) a linestring from
(0, 0.8) to (0, 1) and from (0, 1) to (0.2, 1); 4) a linestring from (0.2,
1) to (0.6, 1) and 5) a linestring from (0.6, 1) to (1, 1).

Even if I use st_segmentize, the split of each segment does not consider
the neighbouring segments.

I'm sure that in a real dataset this is much more difficult and complicated
than this example since
1) it could happend that for one boundary point there are more than one
touching segments and
2) the total length of the network will never be an exact multiple of the
segment distance which implies that there will be some residuals
so, maybe, I'm just asking something unreasonable. I think that any
reasonable refactoring of the sfc of linestrings is fine for my
application.

Il giorno gio 4 lug 2019 alle ore 11:56 Edzer Pebesma <
[hidden email]> ha scritto:

> You may want to look into sf::st_segmentize.
>
> On 7/3/19 10:58 AM, Andrea Gilardi wrote:
> > Hi everyone. This is my first mail to this mailing list so excuse me if
> I'm
> > doing something wrong. I was wondering if it's possible to reshape a
> > LINESTRING sfc of all connected segments into equal length segments.
> >
> > I tried to explain a little bit on why I want to do that here:
> >
> https://gis.stackexchange.com/questions/327376/how-to-divide-linestring-spatial-network-into-equal-length-segments-using-r-sf
> >
> > I'm pretty sure that I'm overcomplicating something simple and, in that
> > case, could you point me in the right direction?
> >
> > Thank you in advance
> > Andrea
> >
> >       [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > R-sig-Geo mailing list
> > [hidden email]
> > https://stat.ethz.ch/mailman/listinfo/r-sig-geo
> >
>
> --
> Edzer Pebesma
> Institute for Geoinformatics
> Heisenbergstrasse 2, 48151 Muenster, Germany
> Phone: +49 251 8333081
> _______________________________________________
> R-sig-Geo mailing list
> [hidden email]
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>
[[alternative HTML version deleted]]

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo

### Re: How to reshape a LINESTRING sfc into equal length segments

Thu, 07/04/2019 - 04:56
You may want to look into sf::st_segmentize.

On 7/3/19 10:58 AM, Andrea Gilardi wrote:
> Hi everyone. This is my first mail to this mailing list so excuse me if I'm
> doing something wrong. I was wondering if it's possible to reshape a
> LINESTRING sfc of all connected segments into equal length segments.
>
> I tried to explain a little bit on why I want to do that here:
> https://gis.stackexchange.com/questions/327376/how-to-divide-linestring-spatial-network-into-equal-length-segments-using-r-sf
>
> I'm pretty sure that I'm overcomplicating something simple and, in that
> case, could you point me in the right direction?
>
> Andrea
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-Geo mailing list
> [hidden email]
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
> --
Edzer Pebesma
Institute for Geoinformatics
Heisenbergstrasse 2, 48151 Muenster, Germany
Phone: +49 251 8333081

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo

### How to reshape a LINESTRING sfc into equal length segments

Wed, 07/03/2019 - 03:58
Hi everyone. This is my first mail to this mailing list so excuse me if I'm
doing something wrong. I was wondering if it's possible to reshape a
LINESTRING sfc of all connected segments into equal length segments.

I tried to explain a little bit on why I want to do that here:
https://gis.stackexchange.com/questions/327376/how-to-divide-linestring-spatial-network-into-equal-length-segments-using-r-sf

I'm pretty sure that I'm overcomplicating something simple and, in that
case, could you point me in the right direction?

Andrea

[[alternative HTML version deleted]]

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo

### Re: Running huge dataset with dnearneigh

Tue, 07/02/2019 - 04:20
for a geomarketing case.

Roger

On Tue, 2 Jul 2019, Roger Bivand wrote:

> On Tue, 2 Jul 2019, Jiawen Ng wrote:
>
>>  Dear Roger,
>>
>>
>>  I am just exploring the aspect of geodemographics in store locations.
>>  There
>>  are many factors that can be considered, as you have highlighted!
>
> OK, so I suggest choosing a modest sized case until a selection of working
> models emerges. Once you reach that stage, you can return to scaling up. I
> think you need much more data on the customer behaviour around the stores you
> use to train your models, particularly customer flows associated with actual
> purchases. Firms used to do this through loyalty programmes and cards, but
> this data is not open, so you'd need proxies which say city bikes will not
> give you.
>
> Geodemographics (used for direct mailing as a marketing tool) have largely
> been eclipsed by profiling in social media with the exception of segments
> without social media profiles. This is because postcode or OA profiling is
> often too noisy and so is expensive because there are many false hits. Retail
> is interesting but very multi-faceted, but some personal services are more
> closely related to population as they are hard to digitise.
>
> Hope this helps,
>
> Roger
>
>>
>>  Thank you so much for taking the time to write back to me! I will study
>>  and
>>
>>  Jiawen
>>
>>  On Mon, 1 Jul 2019 at 19:12, Roger Bivand <[hidden email]> wrote:
>>
>>>  On Mon, 1 Jul 2019, Jiawen Ng wrote:
>>>
>>>>  Dear Roger,
>>>>
>>>>  Thank you so much for your detailed response and pointing out potential
>>>>  pitfalls! It has prompted me to re-evalutate my approach.
>>>>
>>>>  Here is the context: I have some stores' sales data (this is my training
>>>>  set of 214 points), I would like to find out where best to set up new
>>>>  stores in UK. I am using a geodemographics approach to do this: Perform
>>>>  a
>>>>  regression of sales against census data, then predict sales on UK output
>>>>  areas (by centroids) and finally identify new areas with
>>>>  location-allocation models. As the stores are points, this has led me to
>>>>  define UK output areas by its population-weighted centroids, thus
>>>  resulting
>>>>  in the prediction by points rather than by areas. Tests (like moran's I
>>>  and
>>>>  lagrange multiplier) for spatial relationships among the points in my
>>>>  training set were significant hence this has led me to implement some
>>>>  spatial models (specifically spatial lag, error and durbin models) to
>>>>  account for the spatial relationships in the data.
>>>
>>>  I'm afraid that my retail geography is not very up to date, but also that
>>>  your approach is most unlikely to yield constructive results.
>>>
>>>  Most retail stores are organised in large chains, so optimise costs
>>>  between wholesale and retail. Independent retail stores depend crucially
>>>  on access to wholesale stores, so anyway cannot locate without regard to
>>>  supply costs. Some service activities without wholesale dependencies are
>>>  less tied.
>>>
>>>  Most chains certainly behave strategically with regard to each other,
>>>  sometimes locating toe-to-toe to challenge a competing chain
>>>  (Carrefour/Tesco or their local shop variants), sometimes avoiding nearby
>>>  competing chain locations to establish a local monopoly (think
>>>  Hotelling).
>>>
>>>  Population density doesn't express demand, especially unmet demand well
>>>  at
>>>  all. Think food deserts - maybe plenty of people but little disposable
>>>  income. Look at the food desert literature, or the US food stamp
>>>  literature.
>>>
>>>  Finally (all bad news) retail is not only challenged by location shifting
>>>  from high streets to malls, but critically by online shopping, which
>>>  shifts the cost structures one the buyer is engaged at a proposed price
>>>  to
>>>  logistics, to complete the order at the highest margin including returns.
>>>  That only marginally relates to population density.
>>>
>>>  So you'd need more data than you have, a model that explicitly handles
>>>  competition between chains as well as market gaps, and some way of
>>>  handling online leakage to move forward.
>>>
>>>  If population density was a proxy for accessibility (most often it
>>>  isn't),
>>>  it might look like the beginnings of a model, but most often we don't
>>>  know
>>>  what bid-rent surfaces look like, and then, most often different
>>>  activities sort differently across those surfaces.
>>>
>>>>
>>>>  I am quite unsettled and unclear as to which neighbourhood definition to
>>>  go
>>>>  for actually. I thought of IDW at first as I thought this would
>>>>  summarise
>>>>  each point's relationship with their neighbours very precisely thus
>>>  making
>>>>  the predictions more accurate. Upon your advice (don't use IDW or other
>>>>  general weights for predictions), I decided not to use IDW, and changed
>>>  it
>>>>  to dnearneigh instead (although now I am questioning myself on the
>>>>  definition of what is meant by general weights. Perhaps I am
>>>  understanding
>>>>  the definition of general weights wrong, if dnearneigh is still
>>>  considered
>>>>  to be a 'general weights' method) Why is the use of IDW not advisable
>>>>  however? Is it due to computational reasons? Also, why would having
>>>>  thousands of neighbours be making no sense? Apologies for asking so many
>>>>  questions, I'd just like to really understand the concepts!
>>>>
>>>
>>>  The model underlying spatial regressions using neighbours tapers
>>>  dependency as the pairwise elements of (I - \rho W)^{-1} (conditional)
>>>  and
>>>  [(I - \rho W) (I - \rho W')]^{-1} (see Wall 2004). These are NxN dense
>>>  matrices. (I - \rho W) is typically sparse, and under certain conditions
>>>  leads to (I - \rho W)^{-1} = \sum_{i=0}^{\inf} \rho^i W^i, the sum of a
>>>  power series in \rho and W. \rho is typically upward bounded < 1, so
>>>  \rho^i declines as i increases. This dampens \rho^i W^i, so that i
>>>  influences j less and less with increasing i. So in the general case IDW
>>>  is simply replicating what simple contiguity gives you anyway. So the
>>>  sparser W is (within reason), the better. Unless you really know that the
>>>  physics, chemistry or biology of your system give you a known systematic
>>>  relationship like IDW, you may as well stay with contiguity.
>>>
>>>  However, this isn't any use in solving a retail location problem at all.
>>>
>>>>  I believe that both the train and test set has varying intensities. I
>>>>  was
>>>>  weighing the different neighbourhood methods: dnearneigh, knearneigh,
>>>  using
>>>>  IDW etc. and I felt like each method would have its disadvantages -- its
>>>>  difficult to pinpoint which neighbourhood definition would be best. If
>>>  one
>>>>  were to go for knearneigh for example, results may not be fair due to
>>>>  the
>>>>  inhomogeneity of the points -- for instance, point A's nearest
>>>>  neighbours
>>>>  may be within a few hundreds of kilometres while point B's nearest
>>>>  neighbours may be in the thousands. I feel like the choice of any
>>>>  neighbourhood definition can be highly debateable... What do you think?
>>>>
>>>
>>>  When in doubt use contiguity for polygons and similar graph based methods
>>>  for points. Try to keep the graphs planar (as few intersecting edges as
>>>  possible - rule of thumb).
>>>
>>>
>>>>  After analysing my problem again, I think that predicting by output
>>>>  areas
>>>>  (points) would be best for my case as I would have to make use of the
>>>>  population data after building the model. Interpolating census data of
>>>  the
>>>>  output area (points) would cause me to lose that information.
>>>>
>>>
>>>  Baseline, this is not going anywhere constructive, and simply approaching
>>>  retail location in this way is unhelpful - there is far too little
>>>
>>>  If you really must, first find a fully configured retail model with the
>>>  complete data set needed to replicate the results achieved, and use that
>>>  to benchmark how far your approach succeeds in reaching a similar result
>>>  for that restricted area. I think that you'll find that the retail model
>>>  is much more successful, but if not, there is less structure in
>>>  contemporary retail than I though.
>>>
>>>  Best wishes,
>>>
>>>  Roger
>>>
>>>>  Thank you for the comments and the advice so far,  I would greatly
>>>  welcome
>>>>
>>>>  Thank you so much once again!
>>>>
>>>>  Jiawen
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>  On Sun, 30 Jun 2019 at 16:57, Roger Bivand <[hidden email]> wrote:
>>>>
>>>>>  On Sat, 29 Jun 2019, Jiawen Ng wrote:
>>>>>
>>>>>>  Dear Roger,
>>>>>
>>>>>  Postings go to the whole list ...
>>>>>
>>>>>>
>>>>>>  How can we deal with a huge dataset when using dnearneigh?
>>>>>>
>>>>>
>>>>>  First, why distance neighbours? What is the support of the data, point
>>>  or
>>>>>  polygon? If polygon, contiguity neighbours are preferred. If not, and
>>>  the
>>>>>  intensity of observations is similar across the whole area, distance
>>>>>  may
>>>>>  be justified, but if the intensity varies, some observations will have
>>>>>  very many neighbours. In that case, unless you have a clear ecological
>>>  or
>>>>>  environmental reason for knowing that a known distance threshold binds,
>>>  it
>>>>>  is not a good choice.
>>>>>
>>>>>>  Here is my code:
>>>>>>
>>>>>>  d <- dnearneigh(spdf,0, 22000)
>>>>>>  all_listw <- nb2listw(d, style = "W")
>>>>>>
>>>>>>  where the spdf object is in the british national grid CRS:
>>>>>>  +init=epsg:27700, with 227,973 observations/points. The distance of
>>>>>  22,000
>>>>>>  was decided by a training set that had 214 observations and the spdf
>>>>>  object
>>>>>>  contains both the training set and the testing set.
>>>>>>
>>>>>
>>>>>  This is questionable. You train on 214 observations - do their areal
>>>>>  intensity match those of the whole data set? If chosen at random, you
>>>  run
>>>>>  into the spatial sampling problems discussed in:
>>>>>
>>>>>
>>>>>
>>>  https://www.sciencedirect.com/science/article/pii/S0304380019302145?dgcid=author
>>>>>
>>>>>  Are 214 observations for training representative of 227,973 prediction
>>>>>  sites? Do you only have observations on the response for 214, and an
>>>>>  unobserved response otherwise? What are the data, what are you trying
>>>>>  to
>>>>>  do and why? This is not a sensible setting for models using weights
>>>>>  matrices for prediction (I think), because we do not have estimates of
>>>  the
>>>>>  prediction error in general.
>>>>>
>>>>>>  I am using a Mac, with a processor of 2.3 GHz Intel Core i5 and 8 GB
>>>>>>  memory. My laptop showed that when dnearneigh command was run on all
>>>>>>  observations, around 6.9 out of 8GB was used by the rsession and that
>>>  the
>>>>>>  %CPU used by the rsession was stated to be around 98%, although
>>>>>>  another
>>>>>>  indicator showed that my computer was around 60% idle. After running
>>>  the
>>>>>>  command for a day, rstudio alerted me that the connection to the
>>>  rsession
>>>>>>  could not be established, so I aborted the entire process altogether.
>>>>>>  I
>>>>>>  think the problem here may be the size of the dataset and perhaps the
>>>>>>  limitations of my laptop specs.
>>>>>>
>>>>>
>>>>>  On planar data, there is no good reason for this, as each observation
>>>>>  is
>>>>>  treated separately, finding and sorting distances, and choosing those
>>>>>  under the threshold. It will undoubtedly slow if there are more than a
>>>  few
>>>>>  neighbours within the threshold, but I already covered the
>>>>>  of defining neighbours in that way.
>>>>>
>>>>>  Using an rtree might help, but you get hit badly if there are many
>>>>>  neighbours within the threshold you have chosen anyway.
>>>>>
>>>>>  On most 8GB hardware and modern OS, you do not have more than 3-4GB for
>>>>>  work. So something was swapping on your laptop.
>>>>>
>>>>>>  Do you have any advice on how I can go about making a neighbours list
>>>>>  with
>>>>>>  dnearneigh for 227,973 observations in a successful and efficient way?
>>>>>>  Also, would you foresee any problems in the next steps, especially
>>>  when I
>>>>>>  will be using the neighbourhood listw object as an input in fitting
>>>>>>  and
>>>>>>  predicting using the spatial lag/error models? (see code below)
>>>>>>
>>>>>>  model <-  spatialreg::lagsarlm(rest_formula, data=train, train_listw)
>>>>>>  model_pred <- spatialreg::predict.sarlm(model, test, all_listw)
>>>>>>
>>>>>
>>>>>  Why would using a spatial lag model make sense? Why are you suggesting
>>>>>  this model, do you have a behavioural for why only the spatially lagged
>>>>>  response should be included?
>>>>>
>>>>>  Why do you think that this is sensible? You are predicting 1000 times
>>>  for
>>>>>  each observation - this is not what the prediction methods are written
>>>>>  for. Most involve inverting an nxn inverse matrix - did you refer to
>>>>>  Goulard et al. (2017) to get a good understanding of the underlying
>>>>>  methods?
>>>>>
>>>>>>  I think the predicting part may take some time, since my test set
>>>>>  consists
>>>>>>  of 227,973 - 214 observations = 227,759 observations.
>>>>>>
>>>>>>  Here are some solutions that I have thought of:
>>>>>>
>>>>>>  1. Interpolate the test set point data of 227,759 observations over a
>>>>>  more
>>>>>>  manageable spatial pixel dataframe with cell size of perhaps 10,000m
>>>>>>  by
>>>>>>  10,000m which would give me around 4900 points. So instead of 227,759
>>>>>>  observations, I can make the listw object based on just 4900 + 214
>>>>>  training
>>>>>>  points and predict just on 4900 observations.
>>>>>
>>>>>  But what are you trying to do? Are the observations output areas? House
>>>>>  sales? If you are not filling in missing areal units (the Goulard et
>>>>>  al.
>>>>>  case), couldn't you simply use geostatistical methods which seem to
>>>  match
>>>>>  your support better, and can be fitted and can predict using a local
>>>>>  neighbourhood? While you are doing that, you could switch to INLA with
>>>>>  SPDE, which interposes a mesh like the one you suggest. But in that
>>>  case,
>>>>>  beware of the mesh choice issue in:
>>>>>
>>>>>  https://doi.org/10.1080/03610926.2018.1536209
>>>>>
>>>>>>
>>>>>>  2. Get hold of better performance machines through cloud computing
>>>>>>  such
>>>>>  as
>>>>>>  AWS EC2 services and try running the commands and models there.
>>>>>>
>>>>>
>>>>>  What you need are methods, not wasted money on hardware as a service.
>>>>>
>>>>>>  3. Parallel computing using the parallel package from r (although I am
>>>>>  not
>>>>>>  sure whether dnearneigh can be parallelised).
>>>>>>
>>>>>
>>>>>  This could easily be implemented if it was really needed, which I don't
>>>>>  think it is; better methods understanding lets one do more with less.
>>>>>
>>>>>>  I believe option 1 would be the most manageable but I am not sure how
>>>  and
>>>>>>  by how much this would affect the accuracy of the predictions as
>>>>>>  interpolating the dataset would be akin to introducing more
>>>>>>  estimations
>>>>>  in
>>>>>>  the prediction. However, I am also grappling with the trade-off
>>>>>>  between
>>>>>>  accuracy and computation time. Hence, if options 2 and 3 can offer a
>>>>>>  reasonable computation time (1-2 hours) then I would forgo option 1.
>>>>>>
>>>>>>  What do you think? Is it possible to make a neighbourhood listw object
>>>>>  out
>>>>>>  of 227,973 observations efficiently?
>>>>>
>>>>>  Yes, but only if the numbers of neighbours are very small. Look in
>>>  Bivand
>>>>>  et al. (2013) to see the use of some fairly large n, but only with few
>>>>>  neighbours for each observation. You seem to be getting average
>>>  neighbour
>>>>>  counts in the thousands, which makes no sense.
>>>>>
>>>>>>
>>>>>>  Thank you for reading to the end! Apologies for writing a lengthy one,
>>>>>  just
>>>>>>  wanted to fully describe what I am facing, I hope I didn't miss out
>>>>>>  anything crucial.
>>>>>>
>>>>>
>>>>>  Long is OK, but there is no motivation here for why you want to make
>>>  200K
>>>>>  predictions from 200 observations with point support (?) using weights
>>>>>  matrices.
>>>>>
>>>>>  Hope this clarifies,
>>>>>
>>>>>  Roger
>>>>>
>>>>>>  Thank you so much once again!
>>>>>>
>>>>>>  jiawen
>>>>>>
>>>>>>        [[alternative HTML version deleted]]
>>>>>>
>>>>>>  _______________________________________________
>>>>>>  R-sig-Geo mailing list
>>>>>>  [hidden email]
>>>>>>  https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>>>>>
>>>>>
>>>>>  --
>>>>>  Roger Bivand
>>>>>  Department of Economics, Norwegian School of Economics,
>>>>>  Helleveien 30, N-5045 Bergen, Norway.
>>>>>  voice: +47 55 95 93 55; e-mail: [hidden email]
>>>>>  https://orcid.org/0000-0003-2392-6140
>>>>>
>>>>
>>>>        [[alternative HTML version deleted]]
>>>>
>>>>  _______________________________________________
>>>>  R-sig-Geo mailing list
>>>>  [hidden email]
>>>>  https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>>>
>>>
>>>  --
>>>  Roger Bivand
>>>  Department of Economics, Norwegian School of Economics,
>>>  Helleveien 30, N-5045 Bergen, Norway.
>>>  voice: +47 55 95 93 55; e-mail: [hidden email]
>>>  https://orcid.org/0000-0003-2392-6140
>>>
>>
>>   [[alternative HTML version deleted]]
>>
>>  _______________________________________________
>>  R-sig-Geo mailing list
>>  [hidden email]
>>  https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>
>
>
--
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; e-mail: [hidden email]
https://orcid.org/0000-0003-2392-6140

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Roger Bivand
Department of Economics
Norwegian School of Economics
Helleveien 30
N-5045 Bergen, Norway

### Re: Running huge dataset with dnearneigh

Tue, 07/02/2019 - 02:48
On Tue, 2 Jul 2019, Jiawen Ng wrote:

> Dear Roger,
>
>
> I am just exploring the aspect of geodemographics in store locations. There
> are many factors that can be considered, as you have highlighted!

OK, so I suggest choosing a modest sized case until a selection of working
models emerges. Once you reach that stage, you can return to scaling up. I
think you need much more data on the customer behaviour around the stores
you use to train your models, particularly customer flows associated with
actual purchases. Firms used to do this through loyalty programmes and
cards, but this data is not open, so you'd need proxies which say city
bikes will not give you.

Geodemographics (used for direct mailing as a marketing tool) have largely
been eclipsed by profiling in social media with the exception of segments
without social media profiles. This is because postcode or OA profiling is
often too noisy and so is expensive because there are many false hits.
Retail is interesting but very multi-faceted, but some personal services
are more closely related to population as they are hard to digitise.

Hope this helps,

Roger

>
> Thank you so much for taking the time to write back to me! I will study and
>
> Jiawen
>
> On Mon, 1 Jul 2019 at 19:12, Roger Bivand <[hidden email]> wrote:
>
>> On Mon, 1 Jul 2019, Jiawen Ng wrote:
>>
>>> Dear Roger,
>>>
>>> Thank you so much for your detailed response and pointing out potential
>>> pitfalls! It has prompted me to re-evalutate my approach.
>>>
>>> Here is the context: I have some stores' sales data (this is my training
>>> set of 214 points), I would like to find out where best to set up new
>>> stores in UK. I am using a geodemographics approach to do this: Perform a
>>> regression of sales against census data, then predict sales on UK output
>>> areas (by centroids) and finally identify new areas with
>>> location-allocation models. As the stores are points, this has led me to
>>> define UK output areas by its population-weighted centroids, thus
>> resulting
>>> in the prediction by points rather than by areas. Tests (like moran's I
>> and
>>> lagrange multiplier) for spatial relationships among the points in my
>>> training set were significant hence this has led me to implement some
>>> spatial models (specifically spatial lag, error and durbin models) to
>>> account for the spatial relationships in the data.
>>
>> I'm afraid that my retail geography is not very up to date, but also that
>> your approach is most unlikely to yield constructive results.
>>
>> Most retail stores are organised in large chains, so optimise costs
>> between wholesale and retail. Independent retail stores depend crucially
>> on access to wholesale stores, so anyway cannot locate without regard to
>> supply costs. Some service activities without wholesale dependencies are
>> less tied.
>>
>> Most chains certainly behave strategically with regard to each other,
>> sometimes locating toe-to-toe to challenge a competing chain
>> (Carrefour/Tesco or their local shop variants), sometimes avoiding nearby
>> competing chain locations to establish a local monopoly (think Hotelling).
>>
>> Population density doesn't express demand, especially unmet demand well at
>> all. Think food deserts - maybe plenty of people but little disposable
>> income. Look at the food desert literature, or the US food stamp
>> literature.
>>
>> Finally (all bad news) retail is not only challenged by location shifting
>> from high streets to malls, but critically by online shopping, which
>> shifts the cost structures one the buyer is engaged at a proposed price to
>> logistics, to complete the order at the highest margin including returns.
>> That only marginally relates to population density.
>>
>> So you'd need more data than you have, a model that explicitly handles
>> competition between chains as well as market gaps, and some way of
>> handling online leakage to move forward.
>>
>> If population density was a proxy for accessibility (most often it isn't),
>> it might look like the beginnings of a model, but most often we don't know
>> what bid-rent surfaces look like, and then, most often different
>> activities sort differently across those surfaces.
>>
>>>
>>> I am quite unsettled and unclear as to which neighbourhood definition to
>> go
>>> for actually. I thought of IDW at first as I thought this would summarise
>>> each point's relationship with their neighbours very precisely thus
>> making
>>> the predictions more accurate. Upon your advice (don't use IDW or other
>>> general weights for predictions), I decided not to use IDW, and changed
>> it
>>> to dnearneigh instead (although now I am questioning myself on the
>>> definition of what is meant by general weights. Perhaps I am
>> understanding
>>> the definition of general weights wrong, if dnearneigh is still
>> considered
>>> to be a 'general weights' method) Why is the use of IDW not advisable
>>> however? Is it due to computational reasons? Also, why would having
>>> thousands of neighbours be making no sense? Apologies for asking so many
>>> questions, I'd just like to really understand the concepts!
>>>
>>
>> The model underlying spatial regressions using neighbours tapers
>> dependency as the pairwise elements of (I - \rho W)^{-1} (conditional) and
>> [(I - \rho W) (I - \rho W')]^{-1} (see Wall 2004). These are NxN dense
>> matrices. (I - \rho W) is typically sparse, and under certain conditions
>> leads to (I - \rho W)^{-1} = \sum_{i=0}^{\inf} \rho^i W^i, the sum of a
>> power series in \rho and W. \rho is typically upward bounded < 1, so
>> \rho^i declines as i increases. This dampens \rho^i W^i, so that i
>> influences j less and less with increasing i. So in the general case IDW
>> is simply replicating what simple contiguity gives you anyway. So the
>> sparser W is (within reason), the better. Unless you really know that the
>> physics, chemistry or biology of your system give you a known systematic
>> relationship like IDW, you may as well stay with contiguity.
>>
>> However, this isn't any use in solving a retail location problem at all.
>>
>>> I believe that both the train and test set has varying intensities. I was
>>> weighing the different neighbourhood methods: dnearneigh, knearneigh,
>> using
>>> IDW etc. and I felt like each method would have its disadvantages -- its
>>> difficult to pinpoint which neighbourhood definition would be best. If
>> one
>>> were to go for knearneigh for example, results may not be fair due to the
>>> inhomogeneity of the points -- for instance, point A's nearest neighbours
>>> may be within a few hundreds of kilometres while point B's nearest
>>> neighbours may be in the thousands. I feel like the choice of any
>>> neighbourhood definition can be highly debateable... What do you think?
>>>
>>
>> When in doubt use contiguity for polygons and similar graph based methods
>> for points. Try to keep the graphs planar (as few intersecting edges as
>> possible - rule of thumb).
>>
>>
>>> After analysing my problem again, I think that predicting by output areas
>>> (points) would be best for my case as I would have to make use of the
>>> population data after building the model. Interpolating census data of
>> the
>>> output area (points) would cause me to lose that information.
>>>
>>
>> Baseline, this is not going anywhere constructive, and simply approaching
>> retail location in this way is unhelpful - there is far too little
>>
>> If you really must, first find a fully configured retail model with the
>> complete data set needed to replicate the results achieved, and use that
>> to benchmark how far your approach succeeds in reaching a similar result
>> for that restricted area. I think that you'll find that the retail model
>> is much more successful, but if not, there is less structure in
>> contemporary retail than I though.
>>
>> Best wishes,
>>
>> Roger
>>
>>> Thank you for the comments and the advice so far,  I would greatly
>> welcome
>>>
>>> Thank you so much once again!
>>>
>>> Jiawen
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sun, 30 Jun 2019 at 16:57, Roger Bivand <[hidden email]> wrote:
>>>
>>>> On Sat, 29 Jun 2019, Jiawen Ng wrote:
>>>>
>>>>> Dear Roger,
>>>>
>>>> Postings go to the whole list ...
>>>>
>>>>>
>>>>> How can we deal with a huge dataset when using dnearneigh?
>>>>>
>>>>
>>>> First, why distance neighbours? What is the support of the data, point
>> or
>>>> polygon? If polygon, contiguity neighbours are preferred. If not, and
>> the
>>>> intensity of observations is similar across the whole area, distance may
>>>> be justified, but if the intensity varies, some observations will have
>>>> very many neighbours. In that case, unless you have a clear ecological
>> or
>>>> environmental reason for knowing that a known distance threshold binds,
>> it
>>>> is not a good choice.
>>>>
>>>>> Here is my code:
>>>>>
>>>>> d <- dnearneigh(spdf,0, 22000)
>>>>> all_listw <- nb2listw(d, style = "W")
>>>>>
>>>>> where the spdf object is in the british national grid CRS:
>>>>> +init=epsg:27700, with 227,973 observations/points. The distance of
>>>> 22,000
>>>>> was decided by a training set that had 214 observations and the spdf
>>>> object
>>>>> contains both the training set and the testing set.
>>>>>
>>>>
>>>> This is questionable. You train on 214 observations - do their areal
>>>> intensity match those of the whole data set? If chosen at random, you
>> run
>>>> into the spatial sampling problems discussed in:
>>>>
>>>>
>>>>
>> https://www.sciencedirect.com/science/article/pii/S0304380019302145?dgcid=author
>>>>
>>>> Are 214 observations for training representative of 227,973 prediction
>>>> sites? Do you only have observations on the response for 214, and an
>>>> unobserved response otherwise? What are the data, what are you trying to
>>>> do and why? This is not a sensible setting for models using weights
>>>> matrices for prediction (I think), because we do not have estimates of
>> the
>>>> prediction error in general.
>>>>
>>>>> I am using a Mac, with a processor of 2.3 GHz Intel Core i5 and 8 GB
>>>>> memory. My laptop showed that when dnearneigh command was run on all
>>>>> observations, around 6.9 out of 8GB was used by the rsession and that
>> the
>>>>> %CPU used by the rsession was stated to be around 98%, although another
>>>>> indicator showed that my computer was around 60% idle. After running
>> the
>>>>> command for a day, rstudio alerted me that the connection to the
>> rsession
>>>>> could not be established, so I aborted the entire process altogether. I
>>>>> think the problem here may be the size of the dataset and perhaps the
>>>>> limitations of my laptop specs.
>>>>>
>>>>
>>>> On planar data, there is no good reason for this, as each observation is
>>>> treated separately, finding and sorting distances, and choosing those
>>>> under the threshold. It will undoubtedly slow if there are more than a
>> few
>>>> neighbours within the threshold, but I already covered the
>>>> of defining neighbours in that way.
>>>>
>>>> Using an rtree might help, but you get hit badly if there are many
>>>> neighbours within the threshold you have chosen anyway.
>>>>
>>>> On most 8GB hardware and modern OS, you do not have more than 3-4GB for
>>>> work. So something was swapping on your laptop.
>>>>
>>>>> Do you have any advice on how I can go about making a neighbours list
>>>> with
>>>>> dnearneigh for 227,973 observations in a successful and efficient way?
>>>>> Also, would you foresee any problems in the next steps, especially
>> when I
>>>>> will be using the neighbourhood listw object as an input in fitting and
>>>>> predicting using the spatial lag/error models? (see code below)
>>>>>
>>>>> model <-  spatialreg::lagsarlm(rest_formula, data=train, train_listw)
>>>>> model_pred <- spatialreg::predict.sarlm(model, test, all_listw)
>>>>>
>>>>
>>>> Why would using a spatial lag model make sense? Why are you suggesting
>>>> this model, do you have a behavioural for why only the spatially lagged
>>>> response should be included?
>>>>
>>>> Why do you think that this is sensible? You are predicting 1000 times
>> for
>>>> each observation - this is not what the prediction methods are written
>>>> for. Most involve inverting an nxn inverse matrix - did you refer to
>>>> Goulard et al. (2017) to get a good understanding of the underlying
>>>> methods?
>>>>
>>>>> I think the predicting part may take some time, since my test set
>>>> consists
>>>>> of 227,973 - 214 observations = 227,759 observations.
>>>>>
>>>>> Here are some solutions that I have thought of:
>>>>>
>>>>> 1. Interpolate the test set point data of 227,759 observations over a
>>>> more
>>>>> manageable spatial pixel dataframe with cell size of perhaps 10,000m by
>>>>> 10,000m which would give me around 4900 points. So instead of 227,759
>>>>> observations, I can make the listw object based on just 4900 + 214
>>>> training
>>>>> points and predict just on 4900 observations.
>>>>
>>>> But what are you trying to do? Are the observations output areas? House
>>>> sales? If you are not filling in missing areal units (the Goulard et al.
>>>> case), couldn't you simply use geostatistical methods which seem to
>> match
>>>> your support better, and can be fitted and can predict using a local
>>>> neighbourhood? While you are doing that, you could switch to INLA with
>>>> SPDE, which interposes a mesh like the one you suggest. But in that
>> case,
>>>> beware of the mesh choice issue in:
>>>>
>>>> https://doi.org/10.1080/03610926.2018.1536209
>>>>
>>>>>
>>>>> 2. Get hold of better performance machines through cloud computing such
>>>> as
>>>>> AWS EC2 services and try running the commands and models there.
>>>>>
>>>>
>>>> What you need are methods, not wasted money on hardware as a service.
>>>>
>>>>> 3. Parallel computing using the parallel package from r (although I am
>>>> not
>>>>> sure whether dnearneigh can be parallelised).
>>>>>
>>>>
>>>> This could easily be implemented if it was really needed, which I don't
>>>> think it is; better methods understanding lets one do more with less.
>>>>
>>>>> I believe option 1 would be the most manageable but I am not sure how
>> and
>>>>> by how much this would affect the accuracy of the predictions as
>>>>> interpolating the dataset would be akin to introducing more estimations
>>>> in
>>>>> the prediction. However, I am also grappling with the trade-off between
>>>>> accuracy and computation time. Hence, if options 2 and 3 can offer a
>>>>> reasonable computation time (1-2 hours) then I would forgo option 1.
>>>>>
>>>>> What do you think? Is it possible to make a neighbourhood listw object
>>>> out
>>>>> of 227,973 observations efficiently?
>>>>
>>>> Yes, but only if the numbers of neighbours are very small. Look in
>> Bivand
>>>> et al. (2013) to see the use of some fairly large n, but only with few
>>>> neighbours for each observation. You seem to be getting average
>> neighbour
>>>> counts in the thousands, which makes no sense.
>>>>
>>>>>
>>>>> Thank you for reading to the end! Apologies for writing a lengthy one,
>>>> just
>>>>> wanted to fully describe what I am facing, I hope I didn't miss out
>>>>> anything crucial.
>>>>>
>>>>
>>>> Long is OK, but there is no motivation here for why you want to make
>> 200K
>>>> predictions from 200 observations with point support (?) using weights
>>>> matrices.
>>>>
>>>> Hope this clarifies,
>>>>
>>>> Roger
>>>>
>>>>> Thank you so much once again!
>>>>>
>>>>> jiawen
>>>>>
>>>>>       [[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> R-sig-Geo mailing list
>>>>> [hidden email]
>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>>>>
>>>>
>>>> --
>>>> Roger Bivand
>>>> Department of Economics, Norwegian School of Economics,
>>>> Helleveien 30, N-5045 Bergen, Norway.
>>>> voice: +47 55 95 93 55; e-mail: [hidden email]
>>>> https://orcid.org/0000-0003-2392-6140
>>>>
>>>
>>>       [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> R-sig-Geo mailing list
>>> [hidden email]
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>>
>>
>> --
>> Roger Bivand
>> Department of Economics, Norwegian School of Economics,
>> Helleveien 30, N-5045 Bergen, Norway.
>> voice: +47 55 95 93 55; e-mail: [hidden email]
>> https://orcid.org/0000-0003-2392-6140
>>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-Geo mailing list
> [hidden email]
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>
--
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; e-mail: [hidden email]
https://orcid.org/0000-0003-2392-6140

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Roger Bivand
Department of Economics
Norwegian School of Economics
Helleveien 30
N-5045 Bergen, Norway

### Re: Running huge dataset with dnearneigh

Mon, 07/01/2019 - 18:41
Dear Roger,

I am just exploring the aspect of geodemographics in store locations. There
are many factors that can be considered, as you have highlighted!

Thank you so much for taking the time to write back to me! I will study and

Jiawen

On Mon, 1 Jul 2019 at 19:12, Roger Bivand <[hidden email]> wrote:

> On Mon, 1 Jul 2019, Jiawen Ng wrote:
>
> > Dear Roger,
> >
> > Thank you so much for your detailed response and pointing out potential
> > pitfalls! It has prompted me to re-evalutate my approach.
> >
> > Here is the context: I have some stores' sales data (this is my training
> > set of 214 points), I would like to find out where best to set up new
> > stores in UK. I am using a geodemographics approach to do this: Perform a
> > regression of sales against census data, then predict sales on UK output
> > areas (by centroids) and finally identify new areas with
> > location-allocation models. As the stores are points, this has led me to
> > define UK output areas by its population-weighted centroids, thus
> resulting
> > in the prediction by points rather than by areas. Tests (like moran's I
> and
> > lagrange multiplier) for spatial relationships among the points in my
> > training set were significant hence this has led me to implement some
> > spatial models (specifically spatial lag, error and durbin models) to
> > account for the spatial relationships in the data.
>
> I'm afraid that my retail geography is not very up to date, but also that
> your approach is most unlikely to yield constructive results.
>
> Most retail stores are organised in large chains, so optimise costs
> between wholesale and retail. Independent retail stores depend crucially
> on access to wholesale stores, so anyway cannot locate without regard to
> supply costs. Some service activities without wholesale dependencies are
> less tied.
>
> Most chains certainly behave strategically with regard to each other,
> sometimes locating toe-to-toe to challenge a competing chain
> (Carrefour/Tesco or their local shop variants), sometimes avoiding nearby
> competing chain locations to establish a local monopoly (think Hotelling).
>
> Population density doesn't express demand, especially unmet demand well at
> all. Think food deserts - maybe plenty of people but little disposable
> income. Look at the food desert literature, or the US food stamp
> literature.
>
> Finally (all bad news) retail is not only challenged by location shifting
> from high streets to malls, but critically by online shopping, which
> shifts the cost structures one the buyer is engaged at a proposed price to
> logistics, to complete the order at the highest margin including returns.
> That only marginally relates to population density.
>
> So you'd need more data than you have, a model that explicitly handles
> competition between chains as well as market gaps, and some way of
> handling online leakage to move forward.
>
> If population density was a proxy for accessibility (most often it isn't),
> it might look like the beginnings of a model, but most often we don't know
> what bid-rent surfaces look like, and then, most often different
> activities sort differently across those surfaces.
>
> >
> > I am quite unsettled and unclear as to which neighbourhood definition to
> go
> > for actually. I thought of IDW at first as I thought this would summarise
> > each point's relationship with their neighbours very precisely thus
> making
> > the predictions more accurate. Upon your advice (don't use IDW or other
> > general weights for predictions), I decided not to use IDW, and changed
> it
> > to dnearneigh instead (although now I am questioning myself on the
> > definition of what is meant by general weights. Perhaps I am
> understanding
> > the definition of general weights wrong, if dnearneigh is still
> considered
> > to be a 'general weights' method) Why is the use of IDW not advisable
> > however? Is it due to computational reasons? Also, why would having
> > thousands of neighbours be making no sense? Apologies for asking so many
> > questions, I'd just like to really understand the concepts!
> >
>
> The model underlying spatial regressions using neighbours tapers
> dependency as the pairwise elements of (I - \rho W)^{-1} (conditional) and
> [(I - \rho W) (I - \rho W')]^{-1} (see Wall 2004). These are NxN dense
> matrices. (I - \rho W) is typically sparse, and under certain conditions
> leads to (I - \rho W)^{-1} = \sum_{i=0}^{\inf} \rho^i W^i, the sum of a
> power series in \rho and W. \rho is typically upward bounded < 1, so
> \rho^i declines as i increases. This dampens \rho^i W^i, so that i
> influences j less and less with increasing i. So in the general case IDW
> is simply replicating what simple contiguity gives you anyway. So the
> sparser W is (within reason), the better. Unless you really know that the
> physics, chemistry or biology of your system give you a known systematic
> relationship like IDW, you may as well stay with contiguity.
>
> However, this isn't any use in solving a retail location problem at all.
>
> > I believe that both the train and test set has varying intensities. I was
> > weighing the different neighbourhood methods: dnearneigh, knearneigh,
> using
> > IDW etc. and I felt like each method would have its disadvantages -- its
> > difficult to pinpoint which neighbourhood definition would be best. If
> one
> > were to go for knearneigh for example, results may not be fair due to the
> > inhomogeneity of the points -- for instance, point A's nearest neighbours
> > may be within a few hundreds of kilometres while point B's nearest
> > neighbours may be in the thousands. I feel like the choice of any
> > neighbourhood definition can be highly debateable... What do you think?
> >
>
> When in doubt use contiguity for polygons and similar graph based methods
> for points. Try to keep the graphs planar (as few intersecting edges as
> possible - rule of thumb).
>
>
> > After analysing my problem again, I think that predicting by output areas
> > (points) would be best for my case as I would have to make use of the
> > population data after building the model. Interpolating census data of
> the
> > output area (points) would cause me to lose that information.
> >
>
> Baseline, this is not going anywhere constructive, and simply approaching
> retail location in this way is unhelpful - there is far too little
>
> If you really must, first find a fully configured retail model with the
> complete data set needed to replicate the results achieved, and use that
> to benchmark how far your approach succeeds in reaching a similar result
> for that restricted area. I think that you'll find that the retail model
> is much more successful, but if not, there is less structure in
> contemporary retail than I though.
>
> Best wishes,
>
> Roger
>
> > Thank you for the comments and the advice so far,  I would greatly
> welcome
> > and appreciate additional feedback!
> >
> > Thank you so much once again!
> >
> > Jiawen
> >
> >
> >
> >
> >
> >
> >
> >
> > On Sun, 30 Jun 2019 at 16:57, Roger Bivand <[hidden email]> wrote:
> >
> >> On Sat, 29 Jun 2019, Jiawen Ng wrote:
> >>
> >>> Dear Roger,
> >>
> >> Postings go to the whole list ...
> >>
> >>>
> >>> How can we deal with a huge dataset when using dnearneigh?
> >>>
> >>
> >> First, why distance neighbours? What is the support of the data, point
> or
> >> polygon? If polygon, contiguity neighbours are preferred. If not, and
> the
> >> intensity of observations is similar across the whole area, distance may
> >> be justified, but if the intensity varies, some observations will have
> >> very many neighbours. In that case, unless you have a clear ecological
> or
> >> environmental reason for knowing that a known distance threshold binds,
> it
> >> is not a good choice.
> >>
> >>> Here is my code:
> >>>
> >>> d <- dnearneigh(spdf,0, 22000)
> >>> all_listw <- nb2listw(d, style = "W")
> >>>
> >>> where the spdf object is in the british national grid CRS:
> >>> +init=epsg:27700, with 227,973 observations/points. The distance of
> >> 22,000
> >>> was decided by a training set that had 214 observations and the spdf
> >> object
> >>> contains both the training set and the testing set.
> >>>
> >>
> >> This is questionable. You train on 214 observations - do their areal
> >> intensity match those of the whole data set? If chosen at random, you
> run
> >> into the spatial sampling problems discussed in:
> >>
> >>
> >>
> https://www.sciencedirect.com/science/article/pii/S0304380019302145?dgcid=author
> >>
> >> Are 214 observations for training representative of 227,973 prediction
> >> sites? Do you only have observations on the response for 214, and an
> >> unobserved response otherwise? What are the data, what are you trying to
> >> do and why? This is not a sensible setting for models using weights
> >> matrices for prediction (I think), because we do not have estimates of
> the
> >> prediction error in general.
> >>
> >>> I am using a Mac, with a processor of 2.3 GHz Intel Core i5 and 8 GB
> >>> memory. My laptop showed that when dnearneigh command was run on all
> >>> observations, around 6.9 out of 8GB was used by the rsession and that
> the
> >>> %CPU used by the rsession was stated to be around 98%, although another
> >>> indicator showed that my computer was around 60% idle. After running
> the
> >>> command for a day, rstudio alerted me that the connection to the
> rsession
> >>> could not be established, so I aborted the entire process altogether. I
> >>> think the problem here may be the size of the dataset and perhaps the
> >>> limitations of my laptop specs.
> >>>
> >>
> >> On planar data, there is no good reason for this, as each observation is
> >> treated separately, finding and sorting distances, and choosing those
> >> under the threshold. It will undoubtedly slow if there are more than a
> few
> >> neighbours within the threshold, but I already covered the
> >> of defining neighbours in that way.
> >>
> >> Using an rtree might help, but you get hit badly if there are many
> >> neighbours within the threshold you have chosen anyway.
> >>
> >> On most 8GB hardware and modern OS, you do not have more than 3-4GB for
> >> work. So something was swapping on your laptop.
> >>
> >>> Do you have any advice on how I can go about making a neighbours list
> >> with
> >>> dnearneigh for 227,973 observations in a successful and efficient way?
> >>> Also, would you foresee any problems in the next steps, especially
> when I
> >>> will be using the neighbourhood listw object as an input in fitting and
> >>> predicting using the spatial lag/error models? (see code below)
> >>>
> >>> model <-  spatialreg::lagsarlm(rest_formula, data=train, train_listw)
> >>> model_pred <- spatialreg::predict.sarlm(model, test, all_listw)
> >>>
> >>
> >> Why would using a spatial lag model make sense? Why are you suggesting
> >> this model, do you have a behavioural for why only the spatially lagged
> >> response should be included?
> >>
> >> Why do you think that this is sensible? You are predicting 1000 times
> for
> >> each observation - this is not what the prediction methods are written
> >> for. Most involve inverting an nxn inverse matrix - did you refer to
> >> Goulard et al. (2017) to get a good understanding of the underlying
> >> methods?
> >>
> >>> I think the predicting part may take some time, since my test set
> >> consists
> >>> of 227,973 - 214 observations = 227,759 observations.
> >>>
> >>> Here are some solutions that I have thought of:
> >>>
> >>> 1. Interpolate the test set point data of 227,759 observations over a
> >> more
> >>> manageable spatial pixel dataframe with cell size of perhaps 10,000m by
> >>> 10,000m which would give me around 4900 points. So instead of 227,759
> >>> observations, I can make the listw object based on just 4900 + 214
> >> training
> >>> points and predict just on 4900 observations.
> >>
> >> But what are you trying to do? Are the observations output areas? House
> >> sales? If you are not filling in missing areal units (the Goulard et al.
> >> case), couldn't you simply use geostatistical methods which seem to
> match
> >> your support better, and can be fitted and can predict using a local
> >> neighbourhood? While you are doing that, you could switch to INLA with
> >> SPDE, which interposes a mesh like the one you suggest. But in that
> case,
> >> beware of the mesh choice issue in:
> >>
> >> https://doi.org/10.1080/03610926.2018.1536209
> >>
> >>>
> >>> 2. Get hold of better performance machines through cloud computing such
> >> as
> >>> AWS EC2 services and try running the commands and models there.
> >>>
> >>
> >> What you need are methods, not wasted money on hardware as a service.
> >>
> >>> 3. Parallel computing using the parallel package from r (although I am
> >> not
> >>> sure whether dnearneigh can be parallelised).
> >>>
> >>
> >> This could easily be implemented if it was really needed, which I don't
> >> think it is; better methods understanding lets one do more with less.
> >>
> >>> I believe option 1 would be the most manageable but I am not sure how
> and
> >>> by how much this would affect the accuracy of the predictions as
> >>> interpolating the dataset would be akin to introducing more estimations
> >> in
> >>> the prediction. However, I am also grappling with the trade-off between
> >>> accuracy and computation time. Hence, if options 2 and 3 can offer a
> >>> reasonable computation time (1-2 hours) then I would forgo option 1.
> >>>
> >>> What do you think? Is it possible to make a neighbourhood listw object
> >> out
> >>> of 227,973 observations efficiently?
> >>
> >> Yes, but only if the numbers of neighbours are very small. Look in
> Bivand
> >> et al. (2013) to see the use of some fairly large n, but only with few
> >> neighbours for each observation. You seem to be getting average
> neighbour
> >> counts in the thousands, which makes no sense.
> >>
> >>>
> >>> Thank you for reading to the end! Apologies for writing a lengthy one,
> >> just
> >>> wanted to fully describe what I am facing, I hope I didn't miss out
> >>> anything crucial.
> >>>
> >>
> >> Long is OK, but there is no motivation here for why you want to make
> 200K
> >> predictions from 200 observations with point support (?) using weights
> >> matrices.
> >>
> >> Hope this clarifies,
> >>
> >> Roger
> >>
> >>> Thank you so much once again!
> >>>
> >>> jiawen
> >>>
> >>>       [[alternative HTML version deleted]]
> >>>
> >>> _______________________________________________
> >>> R-sig-Geo mailing list
> >>> [hidden email]
> >>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
> >>>
> >>
> >> --
> >> Roger Bivand
> >> Department of Economics, Norwegian School of Economics,
> >> Helleveien 30, N-5045 Bergen, Norway.
> >> voice: +47 55 95 93 55; e-mail: [hidden email]
> >> https://orcid.org/0000-0003-2392-6140
> >>
> >
> >       [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > R-sig-Geo mailing list
> > [hidden email]
> > https://stat.ethz.ch/mailman/listinfo/r-sig-geo
> >
>
> --
> Roger Bivand
> Department of Economics, Norwegian School of Economics,
> Helleveien 30, N-5045 Bergen, Norway.
> voice: +47 55 95 93 55; e-mail: [hidden email]
> https://orcid.org/0000-0003-2392-6140